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Abstract 

We suggest a geometric framework for modelling similarity 
search in large and multidimensional data spaces of gen- 
eral nature, formed by the concept of the similarity work- 
load, which is a probability metric space Q (query domain) 
with a distinguished finite subspace X (dataset), together 
with an assembly of concepts, techniques, and results from 
metric geometry. As some of the latter are being currently 
reinvented by the database community, it seems desirable to 
try and bridge the gap between database research and the 
relevant work already done in geometry and analysis. 

1. Introduction 

Mathematical modelling of similarity search is still very 
much in its infancy, and the modest aim of this paper is 
to spot a few mathematical structures clearly emerging in 
the present practice of similarity search and point in the 
direction of some relevant and well-established concepts, 
ideas, and techniques which belong to geometric and func- 
tional analysis and appear to be relatively little known in the 
database community. 

For the most part, there is a long way to go before (and 
if) the outlined ideas and methods are put into a workable 
shape and made relevant to the concrete needs of similarity 
search. This is certainly the case with the phenomenon of 
concentration of measure on high-dimensional structures, 
which might potentially have the greatest impact of all on 
both theory and practice of similarity search, through of- 
fering a possible insight into the nature of the curse of di- 
mensionality. However, in some other instances the estab- 
lished mathematical methods can be used directly and, we 
believe, most profitably, in order to improve the existing al- 
gorithms for similarity retrieval based on ad hoc, though of- 



ten highly ingenious, mathematical techniques. This could 
well be the case with the technique of metric transform as 
applied to histogram indexing for image search by colour 
content. Even here the gap separating theory and practice 
of database research (discussed in jl4[ ] in a highly colour- 
ful manner) has to be bridged yet. Nevertheless, as the size 
of datasets grows exponentially with time, attempts to un- 
derstand the underlying, very complex, geometry of sim- 
ilarity search through joint efforts of mathematicians and 
computer scientists seem to have no credible alternative. 

2. Data sets and metrics 
2.1. Data structures 

An undisputable — though often downplayed in theoret- 
ical analysis — fact is that a query point, x* , need not be- 
long to an actual dataset, X. This is why we make a distinc- 
tion between the collection of query points (domain) and 
the actual dataset. A domain is a metric space, = (£1, p), 
whose elements are query points, and the metric p is the dis- 
similarity measure. An actual dataset (or instance), X, is a 
finite metric subspace of ft. 
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We have borrowed the concept of a domain from [11 1, 
where it is defined as 'a set such as M. d together with meth- 
ods such as a;-component, order, etc' Even if dissimilar- 
ity measures not satisfying the triangle inequality are some- 
times considered [^|], namely metrics, or else pseudomet- 
rics (for which the condition (p(x, y) = Q) =>■ (x = y) is 
dropped), are of overwhelming importance. Many metric 
spaces appearing in this context are non-Hilbert, e.g. the 
Hamming cube {0, 1}" equipped with the string edit dis- 
tance, or l\. In fact, it is hardly possible to come up with 
any apriori restrictions that distinguish metric spaces 'rele- 
vant for applications' from those that are not. 



2.2. Similarity queries and indexing 

Two major types of similarity queries are range queries 



and nearest neighbour queries. Let x* G £1 be the query 
centre. A generic e-range query is of the form: given an e > 
0, find all x G X with p(x,x*) < e. A generic k-nearest 
neighbour (k-NN) query is of the form: given a natural k 
and an x* G O, find fc elements of X \ {a;*} closest to x* in 
the sense of metric p. 

A fc-NN query can be reduced to a series of range queries 
with varying radii e > chosen by some sort of a binary 
procedure. 

A general index structure [ p"l| ] on a dataset X is just any 
cover r of X (that is, ur = X) with a collection of blocks 
A G T of uniform and 'manageable' size. 

Definition 2.1 A hierarchical tree index structure on a set 
X is a family T — {A t }teT of subsets ofX (blocks) indexed 
with elements (nodes) of a finite tree T = (T, <) in such a 
way that the following are satisfied. 

1. For the root G T, A = X. 

2. If t G T and U are descendants oft, then the sets At ± 
cover At. 

This scheme apparently includes as particular cases k-d 
tree, metric tree, vp-tree, gh-tree, GNAT, M-tree, pyramid 
technique, etc. (See e.g. [jj], ||, [HJ [22| and references 
therein.) 

2.3. Processing range queries 

To process a range query of radius e > centred at x* G 
fl using a hierarchical tree indexing structure V on X, one 
employs the following algorithm. Below 

O t (A) = {x G Vl: p(x, a) < e for some a G A} 

is the e-neighbourhood of A in O. 

1. Sett = 0. 

2. Set A = A t . 

3. If A is a leaf, use exhaustive search to find all x G A 
with p(x, x*) < e. 

4. Else, if it can be certified that x* ^ O e (A), prune the 
sub-tree descending from the internal node t. 

5. Else, for every i = 1, 2, . . . , k, where t\,t2, ■ ■ ■ , tfc are 
descendants of t, do: set t = ti and go to 2. 

If we had means to certify at each step that x* G O e (A), 
then the algorithm could be modified so as to return one of 
the e-neighbours in time O(h), where h is the height of the 
tree T (typically, 0(log n), with n the number of objects in 
the dataset), by traversing down one branch and selecting at 



each step an arbitrary node t such that A and O e (x*) have 
a non-empty intersection. 

Unfortunately, even if the possibility of such certifica- 
tion was (implicitely) assumed by some authors, it is com- 
putationally unfeasible. Instead, the following technique is 
employed. 

A function /: ft — > K is called 1-Lipschitz if 

\f(x)-f(y)\ <p(x,y) 

for each x, y G CI. For each t G T, choose computationally 
inexpensive 1-Lipschitz functions f t and numbers a t , b t > 
such that / t ( A t ) C [a t ,b t ]. If x G G(A t ), then the 
Lipschitz-1 property of f t implies ft(x) G (a t — e, b t + e). 
Thus, the property f t (x) ^ (at — e, b t + e) is a certificate for 
x ^ A t . Yet, the condition ft(x) G (at — e,b t + e) does not 
allow one to make a conclusion about whether or not x is in 
0(A t ), and every such node has to be followed through. 

An example of such kind is the distance function d t from 
some vt G O, called a vantage point (for the node t): 

d t :x h-> d(v t ,x), 

where a = mm{d v (x): x G A} and b = m&x{d v (x): x G 
A} are the corresponding precomputed constants. 

In fact, every subset A C X admits an exact Lipschitz-1 
certification function — namely, the distance from A: 

dA- x i — ► d(x, A) = mi{p(x, a): a G A}, 

with constants a = b = 0. However, the function dA is 
normally far too expensive computationally to be used. 

3. Changing the distance 
3.1. The first complexity issue 

Processing a similarity query requires a large number of 
computations of the values of certification functions, typ- 
ically distances between points. Often performing even a 
single computation of the value of the original dissimilarity 
measure, p, is time-consuming. This is why the following 
technique is often applied — so often, in fact, that we con- 
sider it to form a major component of the abstract geometric 
framework. 

1 . The 'exact' dissimilarity measure p on ft is replaced 
with a computationally cheaper distance d. 

2. For given x* G fl and e > one chooses a S > such 
that for every x G X, the condition p(x, x*) < e implies 
d(x, x*) < 5. Now, instead of processing the e-range query 
in (fl, p, X) centred at x*, one processes the <5-range query 
in (fl, d, X) centred at x* . 

3. For each returned x G X, the condition p(x, x*) < e 
is verified and the false hits discarded. 



2 



Dimensionality reduction and the projection search 
paradigm (see e.g. jl3|]) are examples of the above tech- 
nique. If is a metric subspace of a high-dimensional Eu- 
clidean space I2 (that is, p(x, y) = \\x — y\\ 2 , where ||-|| 2 is 
the Euclidean distance) and ir denotes the projection onto a 
Euclidean subspace 1% C 1% of a lower dimension n < N, 
then d is defined by the formula 

d(x,y) = \\n(x) - n(y)\\ 2 . 

More generally, this is the case with every distance d used 
for prefiltering [|T7||. In this case, d{x, y) < p{x, y) for all 
x,y. 

3.2. Metric transform 

A rich source of new metrics d on X leading to the same 
nearest neighbour graph [0 as the original metric p is the 
classical construction of metric transform [||. Let (X , p) 
be a metric space and let F: K + — > M + be a concave non- 
decreasing function satisfying F(0) = 0. A metric trans- 
form of X by means of F is a pair F(X) = (X, F(p)), 
where F(p) is a metric on X defined by F(p)(x,y) = 
F(p(x,y)). 

The theory of metric transform is fairly advanced. Often 
the metric transform of a non-Euclidean metric space turns 
out to be Euclidean and therefore computationally simple. 
At the same time, the metric transform itself can be per- 
formed at the database population stage. 

3.3. Example: quadratic distance 

If C = {ci, . . . , c„} is a finite set, then a histogram on C 
is an element of the convex hull of C, which we will denote 
by P(C), that is, a linear combination Y2i=i ^i c i> Aj ^ 0' 
Y^i—i Aj = 1- An example we will have in mind is that of 
a colour histogram, showing the colour content of an im- 
age. Here C is a colour space, which is typically a convex 
subset of a low-dimensional Euclidean space equipped with 
the induced distance, pc, and in practice replaced with a fi- 
nite metric subspace through a suitable colour segmentation 
procedure. Histograms over C are exactly the distributions 
of image functions taking values in C. The most natural 
distances on the space of histograms, P{C), are probabil- 
ity metrics JT^], in particular the well-known Kantorovich 
distance, defined for each pi,p2 € P{C) by: 



p(Mi,M2) = inf < 53 |A y I p^.Cj): A i;/ > 0, 

} 

<,3'=1 J 

The Kantorovich metric has the following 'universal prop- 
erty.' Recall that a map is affine if images of segments of 



straight lines are again such. 

Proposition 3.1 Every non-expansive mapping f from a fi- 
nite metric space C to a normed space E extends to a 
unique affine mapping f: P(C) — > E, which is 1-Lipschitz 
with respect to the Kantorovich distance. 

The Kantorovich distance is computationally expen- 
sive. The QBIC project JlO| ] employs instead the so-called 
quadratic distance, which is Euclidean and obtained by 
means of metric transform (though neither fact was realized 
by its inventors and a full advantage of them never taken). 

A quadratic distance [ 10 1, d, on the convex hull P(C) of 
a finite set C = {ci, C2, . . . , c n } is the distance determined 
by the inner product 



0, y) = xAy* 



(2) 



on the linear space spanned by C, where A is a symmetric 
n x n-matrix satisfying a certain positive (semi)definiteness 
condition. Every mapping / from C to a Hilbert space Ti 
extends to an affine map f:P(C) — > 7i, and the distance 
d(x, y) = \\f(x) — f(y)\\ is easily verified to be quadratic. 
Moreover, one can prove that every quadratic distance on 
P(C) is obtained in this way. If now C is equipped with 
a metric and the mapping /: C — > Ti is an isometric em- 
bedding of some metric transform of C, then one obtains 
quadratic distances of the type used in the QBIC project 
[10]. One of the two main distances of this kind used in 
[10] was determined by the matrix a%j = 1 — dij, where 
dij are normalized Euclidean distances between elements 
of the colour space C. Using [§], one can prove that this 
distance is obtained (up to a scalar multiple) in the above 
way via applying to the Euclidean colour space the metric 
transform F(t) — \ft. In addition, C with this distance is 
contained in the unit sphere of a Euclidean space. 



4. Geometry vs complexity 

4.1. Measure concentration phenomenon 

From now on we will equip the query domain fl with a 
probability Borel measure, p. (That is, p is a sigma-additive 
measure with p(Cl) — 1, defined on all sets that can be ob- 
tained from open balls through countable unions, intersec- 
tions, and complements.) We will think of p as reflecting 
the query distribution. 

The quadruple (il,d, p,X) will be called a similarity 
workload. 

Recall that a pair formed by a metric space (fl, p) and 
a probability measure p on it is called a probability metric 
space. The concentration function, a = an, of a probabil- 
ity metric space is defined by 

a Q (e) = 1 - inf ip (O e {A)) ■ A is Borel, p(A) > ~ j , 
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for each e > and an(Q) = 1/2. It is a decreasing func- 
tion in e. If a decreases sharply, then most points of fi 
are close to every subset A C SI containing at least a half 
of all points. Most 'naturally occuring' high-dimensional 
probability metric spaces have sharply decreasing concen- 
tration functions. High-dimensional spheres, balls, Ham- 
ming cubes, Euclidean cubes, Euclidean spaces equipped 
with the Gaussian measure, groups of unitary matrices and 
numerous other objects all have concentration functions not 
exceeding C\ exp(— C^ne 2 ), where n is the dimension and 
Ci, C*2 > 0. This observation is known as the phenomenon 
ofconcentration of measure on high- dimensional structures 
||} |TJ, [l9|]. Very large random structures, such as spin 
glasses, also exhibit this sort of behaviour [20|. 

Assuming a typical multidimensional dataset to have a 
sharply decreasing concentration function allows one to ex- 
plain at least some aspects of the dimensionality curse. At 
the same time, such an assumption on the geometry of data 
is much broader than that of uniformity and independence 
type (cf. [^2[). For an approach based on the concept of 
query instability, proposed in [g], we refer the reader to our 
note [p3|. (This is why we do not discuss the paper |Qj 
here.) 

Notice that in some concrete large datasets the dis- 
tribution density of the distance functions (which are 1- 
Lipschitz) is known to sharply peak near one value. And 
this is exactly the kind of behaviour one would expect in 
the presence of concentration property, in view of the fol- 
lowing well-known result. 

Proposition 4.1 Let /: SI — > M. be a 1 -Lipschitz function, 
and denote by M a median of f, that is, a real number with 
p{{x € X: f(x) < M}) = n({x £ X: f(x) > M}). Then 
for every e > 0, p (/ _1 (M - e, M + e)) > 1 - 2a(e). 

What is still missing, is a series of computational exper- 
iments allowing one to estimate the concentration functions 
of large real datasets. 

4.2. False hits and metric entropy 

Within the outlined paradigm, the phenomenon of con- 
centration of measure can contribute towards the curse of 
dimensionality through a massive amount of false hits re- 
turned by similarity search algorithms. 

To illustrate this on a simple example, consider the pro- 
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jection search paradigm where SI C M and n 
the projection on the chosen coo rdinate axis (that is, n = 1). 
It follows from Proposition 4.1 that for some x* £ fl query 
points x with the property \tt(x) — tt(x*)\ < e form a set 
of measure > 1 — 2a(e). If the concentration function of 
17 is exponential in dimension, it means that the NN search 
along a single coordinate will return all datapoints located 
in a region of f2 having nearly full measure. 



To formulate a general result, assume that the query do- 
main SI is compact. (This assumption does not seem to be 
at all restrictive.) For an e > denote by Nn(e) the min- 
imal number of open e-balls needed to cover SI. (Usually 
instead of this quantity one considers its base 2 logarithm, 
denoted by 7i e (Sl) and called the e-entropy of SI.) Let a 
denote the concentration function of the probability metric 
space (O, p, p). 

Proposition 4.2 Let p and d be two distances on the same 
probability space (f2, p). Then there is a query point x* £ 
SI with the following property. Let e, 8 > be such that 
(p{x,y) < e/3) {d(x,y) < (5/3) for all x,y. Then the 
open ball formed in (f2, d) of radius 6 has measure 

»{0 5 {x*)) > 1 - a (| - a-\N ( ^ d) {e/?>)) 

If now the query domain (fl, p) has the sharply decreas- 
ing concentration function, while the 'capacity' of the ad- 
justed metric space, (SI, d), is low, then for some query 
point, x*, every d-ball centred at x* and containing the p- 
ball of radius e, would contain most of the query points and 
therefore quite probably a large amount of data points as 
well. The transition from ptod results in an highly unde- 
sirable 'blow-up' of the mass of the query domain. 

The frac/e-qj(f between the complexity of computing the 
distance d and the suitably interpreted 'capacity' of the 
space (SI, d) could be an important issue in optimising al- 
gorithms for similarity search. Given a similarity workload 
($1, p, /i, X), does there exist an approximate distance d 
which is computationally simple and yet leaves ample space 
for the dataset to fit in ($1, d) without 'overcrowding'? 

4.3. Example: average colour prefiltering 

In our simplified model the colour space C will form an 
equilateral triangle with side one having R, G, B as its ver- 
tices and equipped with the Euclidean distance. It is seg- 
mented and replaced with a finite subset C arranged in a 
hexagonal lattice. By k we will denote the number of pixels 
in the image frame. A colour image is an arbitrary func- 
tion (picture function) from {1, 2, . . . , k} to C. The set C k 
of all colour images is given the normalized counting mea- 
sure (Ltj. For every image x G C k denote by o-(x) € P(C) 
its colour histogram. One can prove that a is a 1 -Lipschitz 
map. Denote Pk(C) = cr(C fe ) and endow Pk(C) with the 
direct image measure er* (/ij), that is, the measure of a sub- 
set^ C Pfc(C) equals |(T~ 1 (yl)| /n k . The probability mea- 
sure space (Pfc (C) , p, /id ) forms the query domain for image 
query by colour content. It follows from results of [ 18 1 that 



the concentration function, a, of (Pj.(C), p, <7*(/i)) satisfies 



, , 1 f e 2 k 
a{e) < - exp I — — 
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The embedding C C extends, by Proposition 3.1, to an 
affine 1-Lipschitz map i:P(C) — ► C, which is called the 
average colour map in [|fo| and used for prefiltering in the 
QBIC project. 



Using the same technique as in Proposition X.2, one 
can show that if x* is any colour histogram whose aver- 
age colour is + G + B), then the e-range query by 
colour content, preprocessed using the average colour dis- 
tance, will return all images contained in a region of C k 
having measure at least 



1 — 2 exp 



'~8~ 



For example, if k — 100 x 100 and e = 0.1, the measure 
of the above region exceeds 0.99999. Notice at the same 
time that the area of the corresponding open ball inside the 
colour space C is at most 0.073 of the area of the triangle. 

5. Conclusion 



To quote Jl lp, "What seems to be needed is a kind of 
theory of indexability, a mathematical methodology which, 
in analogy with tractability, would evaluate rigorously the 
power and limitations of the indexing techniques in diverse 
contexts." 

This note is a fragment of what might develop into ge- 
ometric theory of indexability — and, most importantly, an 
invitation for collaboration as well. 
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